Names: Ronen Saviz , Shalev Saban, Itsik Temnov.
Up/Down. # If its your first run please install the pandas and plotly.
!pip install pandas_datareader
!pip install plotly
!pip install vaderSentiment
Collecting pandas_datareader
Downloading pandas_datareader-0.10.0-py3-none-any.whl (109 kB)
|████████████████████████████████| 109 kB 865 kB/s eta 0:00:01
Requirement already satisfied: lxml in /home/linuxu/anaconda3/lib/python3.8/site-packages (from pandas_datareader) (4.6.3)
Requirement already satisfied: requests>=2.19.0 in /home/linuxu/anaconda3/lib/python3.8/site-packages (from pandas_datareader) (2.25.1)
Requirement already satisfied: pandas>=0.23 in /home/linuxu/anaconda3/lib/python3.8/site-packages (from pandas_datareader) (1.2.4)
Requirement already satisfied: python-dateutil>=2.7.3 in /home/linuxu/anaconda3/lib/python3.8/site-packages (from pandas>=0.23->pandas_datareader) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in /home/linuxu/anaconda3/lib/python3.8/site-packages (from pandas>=0.23->pandas_datareader) (2021.1)
Requirement already satisfied: numpy>=1.16.5 in /home/linuxu/anaconda3/lib/python3.8/site-packages (from pandas>=0.23->pandas_datareader) (1.20.1)
Requirement already satisfied: six>=1.5 in /home/linuxu/anaconda3/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas>=0.23->pandas_datareader) (1.15.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/linuxu/anaconda3/lib/python3.8/site-packages (from requests>=2.19.0->pandas_datareader) (1.26.4)
Requirement already satisfied: chardet<5,>=3.0.2 in /home/linuxu/anaconda3/lib/python3.8/site-packages (from requests>=2.19.0->pandas_datareader) (4.0.0)
Requirement already satisfied: certifi>=2017.4.17 in /home/linuxu/anaconda3/lib/python3.8/site-packages (from requests>=2.19.0->pandas_datareader) (2020.12.5)
Requirement already satisfied: idna<3,>=2.5 in /home/linuxu/anaconda3/lib/python3.8/site-packages (from requests>=2.19.0->pandas_datareader) (2.10)
Installing collected packages: pandas-datareader
Successfully installed pandas-datareader-0.10.0
Collecting plotly
Downloading plotly-5.8.0-py2.py3-none-any.whl (15.2 MB)
|████████████████████████████████| 15.2 MB 902 kB/s eta 0:00:01
Collecting tenacity>=6.2.0
Downloading tenacity-8.0.1-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.8.0 tenacity-8.0.1
Collecting vaderSentiment
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
|████████████████████████████████| 125 kB 878 kB/s eta 0:00:01
Requirement already satisfied: requests in /home/linuxu/anaconda3/lib/python3.8/site-packages (from vaderSentiment) (2.25.1)
Requirement already satisfied: chardet<5,>=3.0.2 in /home/linuxu/anaconda3/lib/python3.8/site-packages (from requests->vaderSentiment) (4.0.0)
Requirement already satisfied: certifi>=2017.4.17 in /home/linuxu/anaconda3/lib/python3.8/site-packages (from requests->vaderSentiment) (2020.12.5)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/linuxu/anaconda3/lib/python3.8/site-packages (from requests->vaderSentiment) (1.26.4)
Requirement already satisfied: idna<3,>=2.5 in /home/linuxu/anaconda3/lib/python3.8/site-packages (from requests->vaderSentiment) (2.10)
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2
Imports:
#imports
import time
import datetime
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import pandas_datareader as pdr
from datetime import date, timedelta
from urllib.request import urlopen, Request
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
#imports sklearn
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression, LinearRegression
#imports pyspark
from pyspark.sql.types import StringType, FloatType, BooleanType, IntegerType
# Select your stock and date for analysis.
company_name = 'Facebook'
company_symbol = 'FB'
check_date = "27/08/2014"
Finfiz API which brings us the latest news:
# Finviz url.
url = 'https://finviz.com/quote.ashx?t=' + company_symbol
# Requesting url.
req = Request(url=url, headers={'user-agent':'my-app'})
response = urlopen(req)
# Using BeautifulSoup parsing html code.
html = BeautifulSoup(response, 'html')
# Parsing the table by ID and adding to the dictionary.
news_table = html.find(id='news-table')
# Finding all the table rows.
ocgn_rows = news_table.findAll('tr')
# Empty list.
parsed_data = []
counter = 0
# Getting text from each rows.
for idx, row in enumerate(ocgn_rows):
title = row.a.text
publisher = row.span.text
date_data = row.td.text.split(' ')
if len(date_data) == 1:
hour = date_data[0][0:7]
else:
date = datetime.datetime.strptime(date_data[0], '%b-%d-%y').strftime('%d/%m/%Y')
hour = date_data[1][0:7]
# Adding title into parsed_data list.
parsed_data.append([counter, title, date, hour, publisher])
counter += 1
# Reading the data from CSV.
data = pd.read_csv('uci-news-aggregator.csv')
# Print count of each value before the cleaning.
print("Empty values in each column:")
data.isnull().sum()
Empty values in each column:
ID 0 TITLE 0 URL 0 PUBLISHER 2 CATEGORY 0 STORY 0 HOSTNAME 0 TIMESTAMP 0 dtype: int64
# Drop selected features.
def drop_features(data,features):
return data.drop(features, axis=1)
# Filling the empty value column with 'U' (Unknown).
def filling_empty_value_in_unknown_val(data,value):
data[value].fillna('U', inplace=True)
return data
# Clean the data and drops all the not important values.
features = ['URL', 'STORY','HOSTNAME']
data=drop_features(data,features)
data=filling_empty_value_in_unknown_val(data,'PUBLISHER')
# Print count of each value after the cleaning.
print("Empty values in each column after the cleaning:")
data.isnull().sum()
Empty values in each column after the cleaning:
ID 0 TITLE 0 PUBLISHER 0 CATEGORY 0 TIMESTAMP 0 dtype: int64
data.head()
| ID | TITLE | PUBLISHER | CATEGORY | TIMESTAMP | |
|---|---|---|---|---|---|
| 0 | 1 | Fed official says weak data caused by weather,... | Los Angeles Times | b | 1394470370698 |
| 1 | 2 | Fed's Charles Plosser sees high bar for change... | Livemint | b | 1394470371207 |
| 2 | 3 | US open: Stocks fall after Fed official hints ... | IFA Magazine | b | 1394470371550 |
| 3 | 4 | Fed risks falling 'behind the curve', Charles ... | IFA Magazine | b | 1394470371793 |
| 4 | 5 | Fed's Plosser: Nasty Weather Has Curbed Job Gr... | Moneynews | b | 1394470372027 |
# Changes all categories names.
# b-Business.
# t-Technology.
# e-Entertainment.
# m-Medicine.
def change_category_name(data):
if data['CATEGORY'] == 'b':
return 'Business'
if data['CATEGORY'] == 't':
return 'Technology'
if data['CATEGORY'] == 'e':
return 'Entertainment'
if data['CATEGORY'] == 'm':
return 'Medicine'
# Function that changes all the categories names.
def change_all_category_names(data):
data['CATEGORY'] = data.apply(change_category_name, axis = 1)
change_all_category_names(data)
# Creates pie chart by columns names.
def create_pie_chart_of_count(df, column_name,titles=None):
df_not_null = df[~df[column_name].isnull()]
fig = px.pie(df_not_null.groupby([column_name]).size().reset_index(name='count'),labels=titles,names=column_name, values='count')
fig.show()
#Show all the news of the dataset by category.
create_pie_chart_of_count(data, 'CATEGORY')
# Convert timestamp to datetime.
def timestamp_to_date(ts):
s = ts / 1000.0
date = datetime.datetime.fromtimestamp(s).strftime('%d/%m/%Y')
return date
# Function that gets the text and removes all punctuation.
def remove_punctuation(text):
rdd = sc.parallelize(text)
news_rdd = rdd.map(lambda word: word.replace("-"," "))\
.map(lambda word: word.replace(",",""))\
.map(lambda word: word.replace(".",""))\
.map(lambda word: word.replace(";",""))\
.map(lambda word: word.replace("”",""))\
.map(lambda word: word.replace('"',""))\
.map(lambda word: word.replace("[",""))\
.map(lambda word: word.replace("]",""))\
.map(lambda word: word.replace("?",""))\
.map(lambda word: word.replace("/",""))\
.map(lambda word: word.replace("$",""))\
.map(lambda word: word.replace("/",""))\
.map(lambda word: word.replace("'",""))\
.map(lambda word: word.replace(")",""))\
.map(lambda word: word.replace("(",""))\
.map(lambda word: word.replace(":",""))
return news_rdd.collect()
# Function that returns about 100 articles by a particular company before the selected date.
def hundred_news_before_date(df, date, company_name):
timestamp = int(time.mktime(datetime.datetime.strptime(date, "%d/%m/%Y").timetuple()))*1000
comp_df = df[df['TITLE'].str.lower().str.contains(company_name.lower())]
new_df = comp_df[comp_df['TIMESTAMP'] < timestamp]
return new_df.tail(100)
company_news_df = hundred_news_before_date(data, check_date, company_name)
company_news_df
| ID | TITLE | PUBLISHER | CATEGORY | TIMESTAMP | |
|---|---|---|---|---|---|
| 407106 | 407625 | Facebook's 'Internet.org' Effort Reaches Africa | Design \& Trend | Technology | 1406868286342 |
| 407107 | 407626 | Facebook Launches Internet.org App in Zambia | AFKInsider | Technology | 1406868286558 |
| 407108 | 407627 | Facebook unveils free Net app, starting in Zambia | Gulf Today | Technology | 1406868287385 |
| 407109 | 407628 | Facebook Debuts App For Free Web Access In Eme... | MediaPost Communications | Technology | 1406868287986 |
| 407110 | 407629 | Facebook launches free internet app for basic ... | Zee News | Technology | 1406868288262 |
| ... | ... | ... | ... | ... | ... |
| 416011 | 416530 | Fugitive 'who violated his parole' is captured... | Daily Mail | Entertainment | 1409029095025 |
| 416047 | 416566 | Fugitive Arrested After Posting Facebook Video... | Headlines \& Global News | Entertainment | 1409029107770 |
| 416155 | 416674 | Facebook user claims that only 7% of ALS chall... | Daily News \& Analysis | Entertainment | 1409029144538 |
| 416184 | 416703 | #InTheNews: Criminal Who Posted A Video Of His... | 360Nobs.com | Entertainment | 1409029154255 |
| 416204 | 416723 | Roger Federer took the ALS Ice Bucket Challeng... | Tennis Magazine | Entertainment | 1409029160556 |
100 rows × 5 columns
# Function that determines a sentiment of the title.
def sent_analysis(title):
sid_obj= SentimentIntensityAnalyzer()
return sid_obj.polarity_scores(title)
# An example of the sentiment about 100 titles that we received in "hundred_news_before_date" func.
sen = sent_analysis(str(remove_punctuation(company_news_df['TITLE'].str.lower())))
del sen['compound']
# Function that prints a sentiments pie graph by one hundred titels.
def print_sentiments_pie_by_one_hundred_titels():
sentiment_lables=["Negative","Neutral","Positive"]
mycolors = ["red","blue","green"]
myexplode=[0,0,0.1]
plt.figure(figsize=(9,6))
plt.pie(np.array(list(sen.values())),labels=sentiment_lables,colors=mycolors,explode=myexplode,shadow =True,autopct='%1.2f%%')
plt.legend(title = "Example sentiments of 100 titles:")
plt.show()
# Sentiments example of 100 titles.
print_sentiments_pie_by_one_hundred_titels()
# Functions that prints all the categorys of 100 first titles.
def category_graph_countplot(df):
ax=sns.countplot(x=df)
ax.set_title("Category Counts")
ax.set_xlabel("Category")
plt.show()
# Show all the categorys of 100 first titles.
category_graph_countplot(company_news_df['CATEGORY'])
# Creates columns bar by columns names.
def create_bar_of_count(df, column_name):
df_not_null = df[~df[column_name].isnull()]
fig = px.bar(df_not_null.groupby([column_name]).size().reset_index(name='count'), x=column_name, y='count')
fig.show()
# Show all the publishers of a 100 news in a dataset: company_news_df ( in first 100 titles).
create_bar_of_count(company_news_df, 'PUBLISHER')
# The functions gets a date and returns if the stock went up or down one month later.
# False means DOWN and True means UP.
# Date input veriation dd/mm/yyyy.
def up_down_by_date(symbol, start_date):
start = datetime.datetime.strptime(start_date, "%d/%m/%Y")
end = start + timedelta(days=30)
first = pdr.get_data_yahoo(symbols=symbol, start = start, end = end)
x = float(first.tail(1)['Close']) - float(first.head(1)['Close'])
return x > 0
up_down = up_down_by_date(company_symbol, check_date)
if up_down is True:
print('Stock went UP')
else:
print('Stock went DOWN')
Stock went UP
# Initialize list of lists.
first_row = [[80, check_date, sen['neg'], sen['neu'], sen['pos'], company_news_df['PUBLISHER'].mode().values[-1], company_news_df['CATEGORY'].mode().values[-1], up_down]]
# Create the pandas DataFrame with the sentiment, up or down and the dominated publisher and category.
df = pd.DataFrame(first_row, columns = ['ID','Date', 'Neg', 'Neu','Pos', 'DPublisher', 'DCategory', 'Up'])
df
| ID | Date | Neg | Neu | Pos | DPublisher | DCategory | Up | |
|---|---|---|---|---|---|---|---|---|
| 0 | 80 | 27/08/2014 | 0.081 | 0.807 | 0.112 | Wall Street Journal \(blog\) | Technology | True |
# This function returns about 80 dates in jumps of two days back frome the slected date.
last_date = check_date
def make_date():
global last_date
last_date = datetime.datetime.strptime(last_date, "%d/%m/%Y") - timedelta(days=2)
last_date = last_date.strftime("%d/%m/%Y")
return last_date
dates = [make_date() for i in range(80)]
So untill now we showed and example which shows how we got our first row in the dataframe. Now we'll load 100 rows more.
In the next cell we are going to make the dataframe for the machine learning in a parallel by spark.
# This function uses parallel computing using the Spark library.
# We divide the work among all the processes, each
# process runs the hundred_news_before_date function,
# which returns up to 100 articles on a given date.
# All the information that comes back from the
# hundred_news_before_date function, becomes a single line that indicates.
# the sentiment of all 100 articles together on the same date.
# All processes return the information to the master process and
# there we build the data prime for the model analysis.
rdd_dates = spark.sparkContext.parallelize(dates)
id_rdd = spark.sparkContext.parallelize(range(80))
rdd_up_down = rdd_dates.map(lambda x: up_down_by_date(company_symbol, x))
rdd_pub = rdd_dates.map(lambda x: hundred_news_before_date(data, x, company_name)['PUBLISHER'].mode().values[-1])
rdd_cat = rdd_dates.map(lambda x: hundred_news_before_date(data, x, company_name)['CATEGORY'].mode().values[-1])
rdd_news = rdd_dates.map(lambda x: list(hundred_news_before_date(data, x, company_name)['TITLE'].str.lower()))
rdd_sentiments = rdd_news.map(lambda x: sent_analysis(str(x)))
rdd_sentiments = rdd_sentiments.map(lambda x: [x['neg'],x['neu'],x['pos']])
# Collects all the data and creates a new data frame for the master.
id_df = spark.createDataFrame(id_rdd, IntegerType()).toDF("ID").toPandas()
dates_df = spark.createDataFrame(rdd_dates, StringType()).toDF("Date").toPandas()
up_down_df = spark.createDataFrame(rdd_up_down, BooleanType()).toDF("Up").toPandas()
sent_df = spark.createDataFrame(rdd_sentiments).toDF('Neg','Neu','Pos').toPandas()
pub_df = spark.createDataFrame(rdd_pub, StringType()).toDF("DPublisher").toPandas()
cat_df = spark.createDataFrame(rdd_cat, StringType()).toDF("DCategory").toPandas()
new_df = pd.concat([id_df,dates_df,sent_df,pub_df,cat_df, up_down_df], axis=1)
# Add the first row of analyze news,demonstrated in line 26.
new_df.loc[80] = df.values.tolist()[0]
df = new_df
df
| ID | Date | Neg | Neu | Pos | DPublisher | DCategory | Up | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 25/08/2014 | 0.019 | 0.833 | 0.149 | The Independent | Technology | True |
| 1 | 1 | 23/08/2014 | 0.019 | 0.833 | 0.149 | The Independent | Technology | True |
| 2 | 2 | 21/08/2014 | 0.019 | 0.833 | 0.149 | The Independent | Technology | True |
| 3 | 3 | 19/08/2014 | 0.019 | 0.833 | 0.149 | The Independent | Technology | True |
| 4 | 4 | 17/08/2014 | 0.019 | 0.833 | 0.149 | The Independent | Technology | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 76 | 76 | 26/03/2014 | 0.115 | 0.797 | 0.089 | Times of India | Technology | False |
| 77 | 77 | 24/03/2014 | 0.107 | 0.781 | 0.111 | ValueWalk | Technology | False |
| 78 | 78 | 22/03/2014 | 0.083 | 0.797 | 0.120 | ValueWalk | Technology | False |
| 79 | 79 | 20/03/2014 | 0.085 | 0.797 | 0.118 | ValueWalk | Technology | False |
| 80 | 80 | 27/08/2014 | 0.081 | 0.807 | 0.112 | Wall Street Journal \(blog\) | Technology | True |
81 rows × 8 columns
# Graph that compare up vs down in the DataFrame.
print('compare up vs down in the DataFrame:')
sns.countplot(x='Up', data=df)
plt.show()
compare up vs down in the DataFrame:
# Graph that prints the count of all the positive movement stock by the Publisher.
print('Positive DataFrame:')
up_df = df[df['Up'] == True]
create_bar_of_count(up_df, 'DPublisher')
Positive DataFrame:
# Graph that prints the count of all the negative movement stock by the Publisher.
print('Negative DataFrame:')
down_df = df[df['Up'] == False]
create_bar_of_count(down_df, 'DPublisher')
Negative DataFrame:
# Function that prints a heat map of all the features.
def graph_heatmap(data):
plt.figure(figsize=(9,6))
cor = np.abs(data.corr())
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds, vmin=-1, vmax=1)
plt.show()
# Show heatmap graph.
graph_heatmap(df.drop('ID', axis=1))
# Function that checks if there is a direct relationship between the cognitive and the movement of the stock.
def check_connection_between_sentiment_up(data):
if ((data['Pos'] > data['Neg']) and data['Up'] == 1) or ((data['Pos'] < data['Neg']) and data['Up'] == 0):
return True
else:
return False
# Activation of the hypers.
def apply_hypers(data):
data['Cbetween_sentiment_up'] = data.apply(check_connection_between_sentiment_up, axis = 1)
apply_hypers(df)
df
| ID | Date | Neg | Neu | Pos | DPublisher | DCategory | Up | Cbetween_sentiment_up | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 25/08/2014 | 0.019 | 0.833 | 0.149 | The Independent | Technology | True | True |
| 1 | 1 | 23/08/2014 | 0.019 | 0.833 | 0.149 | The Independent | Technology | True | True |
| 2 | 2 | 21/08/2014 | 0.019 | 0.833 | 0.149 | The Independent | Technology | True | True |
| 3 | 3 | 19/08/2014 | 0.019 | 0.833 | 0.149 | The Independent | Technology | True | True |
| 4 | 4 | 17/08/2014 | 0.019 | 0.833 | 0.149 | The Independent | Technology | True | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 76 | 76 | 26/03/2014 | 0.115 | 0.797 | 0.089 | Times of India | Technology | False | True |
| 77 | 77 | 24/03/2014 | 0.107 | 0.781 | 0.111 | ValueWalk | Technology | False | False |
| 78 | 78 | 22/03/2014 | 0.083 | 0.797 | 0.120 | ValueWalk | Technology | False | False |
| 79 | 79 | 20/03/2014 | 0.085 | 0.797 | 0.118 | ValueWalk | Technology | False | False |
| 80 | 80 | 27/08/2014 | 0.081 | 0.807 | 0.112 | Wall Street Journal \(blog\) | Technology | True | True |
81 rows × 9 columns
# Depicthing the relationship between cognition and displacement using pie chart.
create_pie_chart_of_count(df, 'Cbetween_sentiment_up')
# Encoding all the values and preparing for the ml process.
def one_hot_enc(df):
dumC = pd.get_dummies(df.DCategory)
dumP = pd.get_dummies(df.DPublisher)
df = pd.concat([df, dumC], axis='columns')
df = pd.concat([df, dumP], axis='columns')
new_df=drop_features(df,['DPublisher','DCategory'])
return new_df
df
| ID | Date | Neg | Neu | Pos | DPublisher | DCategory | Up | Cbetween_sentiment_up | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 25/08/2014 | 0.019 | 0.833 | 0.149 | The Independent | Technology | True | True |
| 1 | 1 | 23/08/2014 | 0.019 | 0.833 | 0.149 | The Independent | Technology | True | True |
| 2 | 2 | 21/08/2014 | 0.019 | 0.833 | 0.149 | The Independent | Technology | True | True |
| 3 | 3 | 19/08/2014 | 0.019 | 0.833 | 0.149 | The Independent | Technology | True | True |
| 4 | 4 | 17/08/2014 | 0.019 | 0.833 | 0.149 | The Independent | Technology | True | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 76 | 76 | 26/03/2014 | 0.115 | 0.797 | 0.089 | Times of India | Technology | False | True |
| 77 | 77 | 24/03/2014 | 0.107 | 0.781 | 0.111 | ValueWalk | Technology | False | False |
| 78 | 78 | 22/03/2014 | 0.083 | 0.797 | 0.120 | ValueWalk | Technology | False | False |
| 79 | 79 | 20/03/2014 | 0.085 | 0.797 | 0.118 | ValueWalk | Technology | False | False |
| 80 | 80 | 27/08/2014 | 0.081 | 0.807 | 0.112 | Wall Street Journal \(blog\) | Technology | True | True |
81 rows × 9 columns
# Activation of the encoding.
new_df = one_hot_enc(df)
new_df
| ID | Date | Neg | Neu | Pos | Up | Cbetween_sentiment_up | Business | Technology | AllFacebook | ... | The FA Daily | The Independent | The Next Web | Times of India | Ubergizmo | ValueWalk | WKRB News | Wall Street Journal \(blog\) | Washington Post \(blog\) | gamrReview | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 25/08/2014 | 0.019 | 0.833 | 0.149 | True | True | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 23/08/2014 | 0.019 | 0.833 | 0.149 | True | True | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2 | 21/08/2014 | 0.019 | 0.833 | 0.149 | True | True | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 3 | 19/08/2014 | 0.019 | 0.833 | 0.149 | True | True | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 4 | 17/08/2014 | 0.019 | 0.833 | 0.149 | True | True | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 76 | 76 | 26/03/2014 | 0.115 | 0.797 | 0.089 | False | True | 0 | 1 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 77 | 77 | 24/03/2014 | 0.107 | 0.781 | 0.111 | False | False | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 78 | 78 | 22/03/2014 | 0.083 | 0.797 | 0.120 | False | False | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 79 | 79 | 20/03/2014 | 0.085 | 0.797 | 0.118 | False | False | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 80 | 80 | 27/08/2014 | 0.081 | 0.807 | 0.112 | True | True | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
81 rows × 35 columns
# Split the train and the test.
train_df, test_df = train_test_split(new_df, test_size=0.35, random_state=42)
# Show the train_df.
train_df.head(20)
| ID | Date | Neg | Neu | Pos | Up | Cbetween_sentiment_up | Business | Technology | AllFacebook | ... | The FA Daily | The Independent | The Next Web | Times of India | Ubergizmo | ValueWalk | WKRB News | Wall Street Journal \(blog\) | Washington Post \(blog\) | gamrReview | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39 | 39 | 08/06/2014 | 0.054 | 0.833 | 0.114 | False | False | 0 | 1 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 56 | 56 | 05/05/2014 | 0.043 | 0.783 | 0.174 | True | True | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 80 | 80 | 27/08/2014 | 0.081 | 0.807 | 0.112 | True | True | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 7 | 7 | 11/08/2014 | 0.019 | 0.833 | 0.149 | True | True | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 50 | 50 | 17/05/2014 | 0.136 | 0.817 | 0.048 | True | False | 0 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 53 | 53 | 11/05/2014 | 0.097 | 0.785 | 0.118 | True | True | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 19 | 19 | 18/07/2014 | 0.162 | 0.761 | 0.077 | True | False | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 66 | 66 | 15/04/2014 | 0.055 | 0.847 | 0.099 | False | False | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 25 | 25 | 06/07/2014 | 0.113 | 0.836 | 0.050 | True | False | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 44 | 44 | 29/05/2014 | 0.085 | 0.888 | 0.027 | True | False | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 13 | 13 | 30/07/2014 | 0.039 | 0.883 | 0.078 | True | True | 0 | 1 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 77 | 77 | 24/03/2014 | 0.107 | 0.781 | 0.111 | False | False | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3 | 3 | 19/08/2014 | 0.019 | 0.833 | 0.149 | True | True | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 17 | 17 | 22/07/2014 | 0.004 | 0.932 | 0.063 | True | True | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 38 | 38 | 10/06/2014 | 0.054 | 0.833 | 0.114 | False | False | 0 | 1 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8 | 8 | 09/08/2014 | 0.019 | 0.833 | 0.149 | True | True | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 65 | 65 | 17/04/2014 | 0.045 | 0.891 | 0.064 | False | False | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 6 | 13/08/2014 | 0.019 | 0.833 | 0.149 | True | True | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 36 | 36 | 14/06/2014 | 0.015 | 0.900 | 0.084 | True | True | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 72 | 72 | 03/04/2014 | 0.043 | 0.859 | 0.097 | True | True | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
20 rows × 35 columns
# Split independant and dependanat features.
X = train_df.drop(['Up', 'Date'], axis=1)
y = train_df['Up']
# Split the train and the test for ml.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35)
# Drop all the dates and stcok movements.
data_test=test_df.drop(['Up','Date'], axis=1)
# Logistic Regression.
lr = LogisticRegression(random_state=42, solver="liblinear")
lr.fit(X_train, y_train)
lr_predict = lr.predict(data_test)
# SGDClassifier.
sgd = make_pipeline(StandardScaler(),SGDClassifier(max_iter=1000,random_state=42))
sgd.fit(X_train, y_train)
sgd_predict = sgd.predict(data_test)
# MLPClassifier.
mlp = make_pipeline(StandardScaler(),MLPClassifier(max_iter=500))
mlp.fit(X_train, y_train)
mlp_predict = mlp.predict(data_test)
# LR model accuracy.
accuracy = accuracy_score(test_df['Up'], lr_predict)
print('LR accuracy - ', accuracy)
# SGD model accuracy.
accuracy = accuracy_score(test_df['Up'], sgd_predict)
print('SGD accuracy - ', accuracy)
# MLP model accuracy.
accuracy = accuracy_score(test_df['Up'], mlp_predict)
print('MLP accuracy - ', accuracy)
LR accuracy - 0.9310344827586207 SGD accuracy - 0.896551724137931 MLP accuracy - 0.8275862068965517
# Try model on test dataset.
predict_test = lr_predict
# Show the prediction.
comparing_df = test_df.loc[:, test_df.columns.intersection(['ID','Date','Up'])]
comparing_df.insert(3, 'Prediction', predict_test)
comparing_df
| ID | Date | Up | Prediction | |
|---|---|---|---|---|
| 30 | 30 | 26/06/2014 | True | True |
| 0 | 0 | 25/08/2014 | True | True |
| 22 | 22 | 12/07/2014 | True | True |
| 31 | 31 | 24/06/2014 | True | True |
| 18 | 18 | 20/07/2014 | True | True |
| 28 | 28 | 30/06/2014 | True | True |
| 10 | 10 | 05/08/2014 | True | True |
| 70 | 70 | 07/04/2014 | True | True |
| 4 | 4 | 17/08/2014 | True | True |
| 12 | 12 | 01/08/2014 | True | True |
| 49 | 49 | 19/05/2014 | True | True |
| 33 | 33 | 20/06/2014 | True | True |
| 67 | 67 | 13/04/2014 | True | True |
| 35 | 35 | 16/06/2014 | True | True |
| 68 | 68 | 11/04/2014 | False | False |
| 45 | 45 | 27/05/2014 | True | True |
| 73 | 73 | 01/04/2014 | False | True |
| 61 | 61 | 25/04/2014 | True | True |
| 55 | 55 | 07/05/2014 | True | True |
| 40 | 40 | 06/06/2014 | True | True |
| 9 | 9 | 07/08/2014 | True | True |
| 64 | 64 | 19/04/2014 | False | False |
| 5 | 5 | 15/08/2014 | True | True |
| 47 | 47 | 23/05/2014 | True | True |
| 34 | 34 | 18/06/2014 | True | True |
| 62 | 62 | 23/04/2014 | False | False |
| 42 | 42 | 02/06/2014 | True | True |
| 54 | 54 | 09/05/2014 | True | True |
| 16 | 16 | 24/07/2014 | False | True |
# Print data of the prediction.
from sklearn.metrics import classification_report
eval_report = classification_report(test_df['Up'], predict_test, target_names=['0: Down', '1: Up'], output_dict=True)
edf = pd.DataFrame(eval_report).T
edf
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0: Down | 1.000000 | 0.600000 | 0.750000 | 5.000000 |
| 1: Up | 0.923077 | 1.000000 | 0.960000 | 24.000000 |
| accuracy | 0.931034 | 0.931034 | 0.931034 | 0.931034 |
| macro avg | 0.961538 | 0.800000 | 0.855000 | 29.000000 |
| weighted avg | 0.936340 | 0.931034 | 0.923793 | 29.000000 |
# Creating data frame using parsed_data (Last 100 news).
df_api = pd.DataFrame(parsed_data, columns=['ID', 'Title', 'Date', 'Time', 'Publisher'])
df_api.insert(3, 'Category', 'Technology')
def update_category(df, word):
df.loc[(df['Title'].str.lower().str.contains(word)), 'Category'] = "Business"
return df
words = ['buy', 'sell', 'deal', 'sale', 'dollar', 'cash', 'rich', 'business', 'market', 'stock']
for w in words:
df_api = update_category(df_api, w)
# Displaying data frame.
df_api
| ID | Title | Date | Category | Time | Publisher | |
|---|---|---|---|---|---|---|
| 0 | 0 | Liberty and Meta announce expansion of renewab... | 10/05/2022 | Technology | 05:05PM | CNW Group |
| 1 | 1 | Elon Musk gets Twitters Trump ban all wrong | 10/05/2022 | Technology | 04:37PM | Quartz |
| 2 | 2 | Strap on that headset. University of Maryland ... | 10/05/2022 | Technology | 03:15PM | American City Business Journals |
| 3 | 3 | How CEO pay compares to median salaries at App... | 10/05/2022 | Technology | 02:38PM | American City Business Journals |
| 4 | 4 | Elon Musk says he would reverse Twitter's ban ... | 10/05/2022 | Technology | 02:16PM | LA Times |
| ... | ... | ... | ... | ... | ... | ... |
| 95 | 95 | Paramount CFO: We aren't in the same leaky boa... | 03/05/2022 | Technology | 02:33PM | Yahoo Finance |
| 96 | 96 | Facebook Sharpens Its Metaverse Weapons | 03/05/2022 | Technology | 02:20PM | TheStreet.com |
| 97 | 97 | Senators Seek to Loosen Googles Grip on Digita... | 03/05/2022 | Business | 01:38PM | Bloomberg |
| 98 | 98 | Facebooks E-Commerce Bet Stumbles as Meta Look... | 03/05/2022 | Business | 01:21PM | The Wall Street Journal |
| 99 | 99 | Austin mayor to Elon Musk: We would welcome Tw... | 03/05/2022 | Technology | 01:06PM | Yahoo Finance |
100 rows × 6 columns
# Removing spare white spaces from df_api publisher columns.
df_api['Publisher'] = df_api['Publisher'].apply(lambda x: x.strip())
# Finding most dominant Publisher of today's lastest news.
dpub = df_api['Publisher'].mode().values[-1]
temp = df_api
# Did we train our data with this publisher we got from the API ? if not get the second most dominant.
while(1):
if dpub in set(df.DPublisher):
break
temp = temp[temp['Publisher'] != dpub]
dpub = temp['Publisher'].mode().values[-1]
# Processing the latest articels.
sentiment = sent_analysis(str(remove_punctuation(df_api['Title'].str.lower())))
# Initialize list of lists.
today = datetime.datetime.today().strftime('%d/%m/%Y')
data1 = [[0, today, sentiment['neg'],sentiment['neu'],sentiment['pos'], dpub, df_api['Category'].mode().values[-1]]]
# Create the pandas DataFrame with the sentiment, up or down and the dominated publisher and category.
df_final = pd.DataFrame(data1, columns = ['ID','Date', 'Neg','Neu','Pos', 'DPublisher', 'DCategory'])
df_final
| ID | Date | Neg | Neu | Pos | DPublisher | DCategory | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 11/05/2022 | 0.094 | 0.83 | 0.076 | Reuters | Technology |
# Encoding the latest articles.
publisher = df_final.DPublisher[0]
prediction_val = new_df[new_df[publisher] == 1].head(1).drop(['Up','Date'], axis=1)
prediction_val['Neg'], prediction_val['Neu'], prediction_val['Pos'] = sentiment['neg'],sentiment['neu'],sentiment['pos']
prediction_val['ID'] = 82
prediction_val
| ID | Neg | Neu | Pos | Cbetween_sentiment_up | Business | Technology | AllFacebook | Bloomberg | Economic Times | ... | The FA Daily | The Independent | The Next Web | Times of India | Ubergizmo | ValueWalk | WKRB News | Wall Street Journal \(blog\) | Washington Post \(blog\) | gamrReview | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 42 | 82 | 0.094 | 0.83 | 0.076 | False | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 rows × 33 columns
# Final prediction of rise or fall by the current date.
# Up = True , Down = False.
final_prediction = lr.predict(prediction_val)
if final_prediction[0]:
print(company_name, 'will go UP in a month from now.')
else:
print(company_name, 'will go DOWN in a month from now.')
Facebook will go DOWN in a month from now.